In this project I used Princple Component Analysis to examine the degree of separability between modeled neuron electrical recordings and real electrical recordings from actual neurons.
If biologically realistic models were better at imitating real experimental cells, then data and models would not easily be discriminable. By plotting a 48 dimensional feature space onto a two dimensional projection space, I show that a diverse pool of data and models are readily discriminated via Random Forest Classification, a result, that leaves even some of the most optimized models lacking. The idea is that the models which are the most resistant to being correctly machine-classified as models (therefore being misclassified as data), serve as better imitations/mimics of experimental data. I also used random forest regression to investigate when experimental data inform a classifying statistical model which dimensions explain the most of the observed variance in the feature space. Variance-explained will facilitate the production of a list of improvements to make to our models in order to render models better imitations of real data.
In this project you can see use of:
There is a great diversity of real biological neurons, all of which differ substantially in their electrical behavior. There are a few different classes of general purpose neuronal models, that can reproduce these different types of electrical behaviours, given appropriate parameterizations of the models.
An exisiting class of neuron model type, called The Izhikevich model was published with parameter sets believed to make the model outputs accurately align with a variety of real biological cell outputs. However since publication much very specific electro physiological recordings have accumulated, that in someways undermine model/experiment agreement. However it is now possible to constrain the Izhikevich model and find new parameterizations that more allow us to more accurately reproduce more recently published experimental data.
In contrast to other projects that seek to use features to seperate and classify two different categories of things that are hard to tell apart, such that humans can benefit from a fast classification of hard to discern differences in high dimensional spaces. In this project the goal is to use resistance to classification as an indicator of an optimization algorithms success, and to use machine seperation of data categories as an error signal, that directs us to precise locations of model failure. Another way of saying this, is, if a good/fair attempt at machine classification is hard, then then a different machine learning algorithm did a good job. If machine classification is very easy, the optimization algorithm did a poor job.
I used the approach described herein for different research work intended for a conference abstract published as follows: J Birgiolas, R Jarvis, V Haynes, R Gerkin, SM Crook (2019) Automated assessment and comparison of cortical neuron models BMC Neuroscience 2019, 20(Suppl 1):P47
The application of TSNE to data was developed in a research team context on different data pertaining to ion channels, or the APs exclusively derived from models (as opposed to a combination of models and data). In the context of this project, I have used novel experimental data (pulled from the Allen Brain Portal API) and novel models (8 optimized cell models included), so I have re-applied a small amount of code from pre-established work, but I have made substantial novel contributions, by looking at different features, applying different feature engineering, applying Random Forest Classification, applying variance explained, and interpreting results. For a comparison to other pre-established work that informed this work check here
Before Machine Learning and analysis techniques could be applied, we needed to find optimized models. These optimized models can be understood as models that are intended to be superior mimics of real biologically derived data, as their governing equation parameters have been more rigorously constrained by a wider range of experimental data.
In order illustrate that the optimized models are better imitations of real data, four adaptive Exponential models, and four Izhikevich models each were fitted to four different classes of experimental cells see implementation in ipython notebook Notebook. These eight fitted models were subsequently fed into a Druckman feature extraction algorithm, and added as data points in a dimension reduced plot of the feature space. Many pre-existing neural models, and some Allen Brain Data where also plotted as contextual data in the same feature space.
The Druckman feature analysis protocol originates from MATLAB code associated with the analysis of Blue Brain Project Modelled cells, this feature analysis pipeline was then ported to Python by Justas Birgiolas, at a later point I made the feature analysis pipeline applicable to optimized Adaptive Exponential and Izhiketch cells. Rick Gerkin and Vergil Haynes, assisted in data cleaning preperation and TSNE application.
# !pip install update matplotlib
import plotly.graph_objs as go
import pickle
from sklearn import preprocessing
The highly informative figures below come from Efeatures: https://efel.readthedocs.io/en/latest/eFeatures.html#spike-shape-features
In the figures below you can some different electrical behavior corresponding to two different multi-spiking electrical experiments.

The following figure shows the difference in a multispiking waveforms between an Izhikitich model and an adaptive exponential spiking model:

# THe transformed values, ordered from highest to lowest variance dimensions
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import plotly.express as px
import plotly.graph_objects as go
from sklearn.decomposition import PCA
#
import plotly.io as pio
GITHUB = True
if GITHUB:
pio.renderers.default = "svg"
else:
pio.renderers.default = 'notebook'
#
%%capture
data = []
import warnings
#warnings.filter("ignore")
import pickle
import plotly
import chart_studio.plotly as py
import plotly.offline as py
py.init_notebook_mode(connected=True)
import seaborn
seaborn.set_style("ticks", {"xtick.major.size": 8, "ytick.major.size": 8})
#!conda install -c plotly plotly-orca
useable = pickle.load(open('optimized_multi_feature','rb'))
useable[0].tests
import pickle
import pandas as pd
pd.set_option('precision', 3)
#pd.set_option('max_columns',None)
#pd.set_option('max_rows',None)
hbp = pickle.load(open("hbp_data.p","rb"))
df_o_m3 = None
dict_ = {}
for i in hbp:
#for lit in list(i.out_dic.items()):
if i is not None:
#mod = rekeyeddm(mod)
for v in list(i.out_dic.values()):
temp = list(i.out_dic.values())
if temp[1] is not None:
dict_.update(temp[1][0])
#print(temp[0])
data = pd.DataFrame(data=dict_,index=[temp[0]])
if i == 0:
df_o_m3 = data
else:
df_o_m3 = pd.concat([data,df_o_m3])
del df_o_m3['interburst_voltage']
df_o_m3 = pd.DataFrame.drop_duplicates(df_o_m3)
bbp_frame = df_o_m3
stable_list = list(bbp_frame.index.values);
df_o_m3;
stable_list;
df = pd.concat([df,bbp_frame])
import os
#import dask.dataframe as dd
%matplotlib inline
import pickle
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
sns.set(font_scale=1.5)
import pandas as pd
os.getcwd()
from sklearn.preprocessing import StandardScaler
import copy
from neuronunit.optimisation.optimization_management import three_step_protocol, filtered
# Special stuff to import
from sklearn.decomposition import PCA
from sklearn.manifold import Isomap, TSNE
#try:
with open('dm_on_models.p','rb') as f:
(RAW_dtc,ADEXP_dtc) = pickle.load(f)
%%capture
try:
useable = pickle.load(open('optimized_multi_feature','rb'))
except:
useable = []
for value in RAW_dtc.values():
dtcpop = value
dtcpop = [ dtc for dtc in dtcpop if type(dtc.rheobase) is not type(None) ]
useable.extend(list(map(feature_mine,dtcpop)))
useable = [ dtc for dtc in dtcpop if hasattr(dtc,'allen_30') ]
pickle.dump(useable,open('optimized_multi_feature','wb'))
For five optimized cells
df_o_m2 = None
dict_ = {}
for i,dtc in enumerate(useable):
dict_.update(**dtc.everything)
dict_.update(**dtc.dm_results_regular)
data = pd.DataFrame([dict_])
if i == 0:
df_o_m2 = data
else:
df_o_m2 = pd.concat([data,df_o_m2])
df_o_m = df_o_m2
df_o_m;
the blue brain data with the naan column may be what breaks other versions of this notebook.
Its data frame is called df_o_m3
# A function to convert all cells containing array (or other things) into floats.
def f(x):
try:
return np.mean(x)
except:
try:
return np.mean(x['mean'])
except:
print(x)
raise e
# Open the 1.5x rheobase file
import os
cwd = os.getcwd()
path2data = os.path.join(cwd,'data')
filename = os.path.join(cwd,'onefive_df.pkl')
with open(filename, 'rb') as f:
df = pickle.load(f)
# A function to convert all cells containing array (or other things) into floats.
def custom_cleaner(x):
try:
return np.mean(x)
except:
try:
return np.mean(x['pred'])
except:
return np.mean(x['mean'])
df = df.fillna(0).applymap(custom_cleaner)
# Apply this function to each dataframe in order to convert all cells into floats.
# Also call fillna() first to impute missing values with 0's.
%time df = df.fillna(0).applymap(custom_cleaner)
#df_30x = df_30x.fillna(0).applymap(f)
df.head();
def mutate_frame_columns(df_everything,modifyer):
"""
Adjust data frame column names so they fit into a universal scheme.
"""
for c in df_everything.columns:
if str(c)+str('_1.5x') in modifyer.columns:
df_everything[str(c)+str('_1.5x')] = df_everything[c]
if str(c)+str('_3.0x') in modifyer.columns:
df_everything[str(c)+str('_3.0x')] = df_everything[c]
temp = df_everything
for c in df_everything.columns:
if str('_1.5x') not in str(c) and str('_3.0x') not in str(c): #in df_everything.columns:
temp = temp.drop(columns=[c])
#if str(c)+str('_3.0x') in df_everything.columns:
# df_everything[str(c)+str('_1.5x')] = df_everything[c]
df_everything = temp
return df_everything
df_o_m3 = mutate_frame_columns(df_o_m3,df)
df_o_m3.tail(5);
df = pd.concat([df,df_o_m3])
df['AP_fall_indices_1.5x'];
df = df.fillna(0)#.applymap(f)
df = df.applymap(custom_cleaner)
bbp_frame = pd.DataFrame.drop_duplicates(bbp_frame)
df.index[42]
df_everything = df
df_everything = df_everything.fillna(0)
df_everything;
df_everything.tail(15);
Which serves as a table helping us to align Druckman features with high current and low current protocols in other feature sets.
df = df_everything
def remove_reduncy(df,removed):
for c in removed.columns:
if c in df.columns:
del df[c]
return df
with open('redundancy_removed.p','rb') as f:
removed = pickle.load(f)
df = remove_reduncy(df,removed)
for c in removed.columns:
try:
del df[c]
except:
pass
df = df.fillna(0).applymap(custom_cleaner)
Finally we have prepared a pandas data frame, where the first half of the data frame, that is the first 448 entries are Druckman measurements pertaining to voltage traces recorded in real biological cells. Appended immediately below in the same data frame we have 965 Druckman measurements derived from NeuroML models. A print out of this frame follows.
df.tail(15);
This is a list of the features that explain the most variance.
which were not available or accessible in the real data. This includes mainly negative current injection protocols such as input resistance etc.
if 'InputResistanceTest_1.5x' in df.columns:
# in order to find out what is seperating and what is not.
del df['InputResistanceTest_1.5x']
del df['InputResistanceTest_3.0x']
del df['ohmic_input_resistance_1.5x']
del df['ohmic_input_resistance_3.0x']
del df['time_1.5x']
# 0.190362
del df['decay_time_constant_after_stim_3.0x']
del df['voltage_deflection_3.0x']
del df['steady_state_hyper_3.0x']
del df['steady_state_voltage_stimend_3.0x']
del df['voltage_deflection_vb_ssse_3.0x']
del df['sag_amplitude_3.0x']
#0.198310
del df['is_not_stuck_1.5x']
del df['AHP_depth_abs_1.5x']
"""
df.tail(15);
# Join the dataframes horizonstally, so that all features coming from df_15x get a '_1.5x' suffix
# and all the ones from df_30x get a '_3.0x' suffix
#df = df_15x.join(df_30x, lsuffix='_1.5x', rsuffix='_3.0x')
def other_cleaner(x):
if type(x) is type(dict()):
try:
return np.mean(x['pred'])
except:
return np.mean(x['mean'])
else:
return np.mean(x)
df = df_everything.fillna(0).applymap(other_cleaner)
df_everything = df
print("There are %d models+data and %d features" % df_everything.shape)
#print("There are %d models+data and %d features" % dask_frame.shape)
*lets experiment with dask -lazy pandas array to avoid storing all the data frame in memory all at once.
# Turn all features into Normal(0,1) variables
# Important since features all have different scales
# make model dataframe
model_idx2 = df.head(7).index.tolist()# list(range(0,9))#idx for idx in df.index.values if type(idx)==str]
model_no_trans_df2 = df[df.index.isin(model_idx2)]
model_no_trans_df2.index.name = 'OptCells'
model_df2 = model_no_trans_df2.copy()
model_df2.index.name = 'OptCells'
#model_df2[:]["type"] = "opt_models"
model_idx = [idx for idx in df.index.values if type(idx)==str]
model_no_trans_df = df[df.index.isin(model_idx)]
model_no_trans_df.index.name = 'Cell_ID'
model_df = model_no_trans_df.copy()
model_df.index.name = 'Cell_ID'
# make experiment dataframe
experiment_idx = [idx for idx in df.index.values if type(idx)==int]
experiment_no_trans_df = df[df.index.isin(experiment_idx)]
experiment_df = experiment_no_trans_df.copy()
model_df2
model_no_trans_df2;
experiment_df = pd.DataFrame.drop_duplicates(experiment_df)
experiment_df.groupby(experiment_df.index).first()
len(experiment_df);
model_df = pd.DataFrame.drop_duplicates(model_df)
model_df.index
model_df.groupby(model_df.index).first()
not_empty = copy.copy(df)
not_empty;
that we know are intended to be highly representative of the data. These cells models are by author Gouwens
Primary visual area layer 4 spiny 479728896 Cell https://neuroml-db.org/model_info?model_id=NMLCL001508
I use PCA as a heuristic aid, to facilitate human intuition about Machine classification of data versus models.
It is possible for the human visual system to see three clusters of cell data points.
Let's show that we can classify in this low dimensional space (by just using two features). We will slowly build up to classification via first applying Kmeans, to visualize cluster centres. And then move on to using a random forest approach to actually visualizing decision boundaries.
def remove_reduncy(df,removed):
for c in removed.columns:
if c in df.columns:
del df[c]
return df
with open('redundancy_removed.p','rb') as f:
removed = pickle.load(f)
df = remove_reduncy(df,removed)
<<<<<<< HEAD
#print(len(df.columns))
=======
print(len(df.columns))
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
We can answer the question are electrophyiological measurements in cells from different experimental sources discriminible.
Actually there is good agreement in the variance between Blue Brain Cell Data and Allen Brain Cell data.
Adding in a variety of models could change this story.
df_o_m3.index;
df_o_m3 = df_o_m3.replace([np.inf, -np.inf], np.nan)#.dropna()
experiment_df = experiment_df.replace([np.inf, -np.inf], np.nan)
df_data = pd.concat([experiment_df,df_o_m3])
df1 = copy.copy(df)
for c in df1.columns:
if np.mean(df1[c])==0 or np.var(df1[c])==0:
print(c)
df.apply(custom_cleaner);
bw bandwidth
=======bw bandwidth
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c#model_index_labels
<<<<<<< HEAD
import pickle
=======
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
'''
pd.set_option('precision', 3)
n_cols = 3
n_rows = int(len(df1.columns)/3)+1
fig, axes = plt.subplots(n_rows,3,figsize=(15, 3*n_rows))
for i, feature in enumerate(df1.columns):
ax = axes.flat[i]
ax.set_ylabel(str(feature)[0:10])
#df1[feature].hist(alpha=0.3, ax=ax)
n = df1.loc[model_index_labels, feature].notnull().sum()
sns.kdeplot(df1.loc[model_index_labels, feature], color='r', ax=ax, label=('%d' % n))
sns.kdeplot(df1.loc[experiment_idx_labels, feature], color='b', ax=ax)
#ax.legend()
ax.set_ylabel(feature.replace('_', '\n'))
'''
=======
pd.set_option('precision', 3)
n_cols = 3
n_rows = int(len(df1.columns)/3)+1
fig, axes = plt.subplots(n_rows,3,figsize=(15, 3*n_rows))
for i, feature in enumerate(df1.columns):
ax = axes.flat[i]
ax.set_ylabel(str(feature)[0:10])
#df1[feature].hist(alpha=0.3, ax=ax)
n = df1.loc[model_index_labels, feature].notnull().sum()
sns.kdeplot(df1.loc[model_index_labels, feature], color='r', ax=ax, label=('%d' % n))
sns.kdeplot(df1.loc[experiment_idx_labels, feature], color='b', ax=ax)
#ax.legend()
ax.set_ylabel(feature.replace('_', '\n'))
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
#ax.set_xlabel(units[feature])
#plt.tight_layout()
'''
n_cols = 3
n_rows = int(len(df1.columns)/3)+1
fig, axes = plt.subplots(n_rows,3,figsize=(15, 3*n_rows))
for i, feature in enumerate(df1.columns):
ax = axes.flat[i]
ax.set_ylabel(str(feature)[0:10])
#df1[feature].hist(alpha=0.3, ax=ax)
n = df1.loc[model_index_labels, feature].notnull().sum()
sns.kdeplot(df1.loc[model_index_labels, feature], color='r', ax=ax, label=('%d' % n))
sns.kdeplot(df1.loc[experiment_idx_labels, feature], color='b', ax=ax)
#ax.legend()
ax.set_ylabel(feature.replace('_', '\n'))
fig, axes = plt.subplots(int(np.sqrt(len(df1.columns))), int(np.sqrt(len(df1.columns)))+1,figsize=(15, 15))
for i, feature in enumerate(df1.columns):
axes.flat[i].set_ylabel(str(feature)[0:10])
min_ = np.min(df1[feature])
max_ = np.max(df1[feature])
axes.flat[i].set_xlim(min_,max_)
df1[feature].plot(alpha=0.3, ax=axes.flat[i])
plt.tight_layout()
'''
=======
n_cols = 3
n_rows = int(len(df1.columns)/3)+1
fig, axes = plt.subplots(n_rows,3,figsize=(15, 3*n_rows))
for i, feature in enumerate(df1.columns):
ax = axes.flat[i]
ax.set_ylabel(str(feature)[0:10])
#df1[feature].hist(alpha=0.3, ax=ax)
n = df1.loc[model_index_labels, feature].notnull().sum()
sns.kdeplot(df1.loc[model_index_labels, feature], color='r', ax=ax, label=('%d' % n))
sns.kdeplot(df1.loc[experiment_idx_labels, feature], color='b', ax=ax)
#ax.legend()
ax.set_ylabel(feature.replace('_', '\n'))
fig, axes = plt.subplots(int(np.sqrt(len(df1.columns))), int(np.sqrt(len(df1.columns)))+1,figsize=(15, 15))
for i, feature in enumerate(df1.columns):
axes.flat[i].set_ylabel(str(feature)[0:10])
min_ = np.min(df1[feature])
max_ = np.max(df1[feature])
axes.flat[i].set_xlim(min_,max_)
df1[feature].plot(alpha=0.3, ax=axes.flat[i])
plt.tight_layout()
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
df.columns;
ss = StandardScaler()
df[:] = ss.fit_transform(df.values)
df.groupby(df.index).first()
df = pd.DataFrame.drop_duplicates(df)
df[:] = preprocessing.normalize(df.values, norm='l2')
with open('gouwens.p','rb') as f:
gouwens = pickle.load(f)
gwn_check = list(gouwens.values())
gwindex = gwn_check[0][0]
gouwens_idx = [idx for idx in df.index.values if idx in gwindex]
gouwens_idx_labels = [i for i in gouwens_idx]
gouwens_idx_labels = df.index.isin(gouwens_idx)
gouwens_cells = df[df.index.isin(gouwens_idx)]
unusual = df[df.index.isin(gouwens_idx)]
unusual
print(gouwens_cells.index[-7],'a Gouwens model that well represents data')
gouwens_cells;
with open('markram.p','rb') as f:
markram = pickle.load(f)
markram_check = list(markram.values())
markramindex = markram_check[0][0]
markram_idx = [idx for idx in df.index.values if idx in markramindex]
markram_idx_labels = [i for i in markram_idx]
markram_idx_labels = df.index.isin(markram_idx)
markram_cells = df[df.index.isin(markram_idx)]
markram_cells;
experiment_idx = [idx for idx in df.index.values if type(idx)==int]
model_no_trans_df = df[~df.index.isin(experiment_idx)]
experiment_idx_labels = [(i,idx) for i,idx in enumerate(df.index.values) if type(idx)==int]
#model_df
#df.labels
model_no_trans_df
experiment_idx_labels = [i[0] for i in experiment_idx_labels]
experiment_idx_labels
model_no_trans_df
model_index_labels = ~df.index.isin(experiment_idx)
model_index_labels
new_models_idx = df.head(7).index.tolist()# list(range(0,9))#idx for idx in df.index.values if type(idx)==str]
new_model_labels= df.index.isin(new_models_idx)
#len(new_models_idx)
nm = df.head(7)
nm;
bbp_labels = df.index.isin(stable_list)
bbpindex = df.index.isin(stable_list)
bbpindex = [i for i,idx in enumerate(df.index.values) if idx in stable_list]
bbp = []
for idx in df.index.values:
if idx in stable_list:
bbp.append(True)
else:
bbp.append(False)
gouwens_idx_labels
bbp_idx = [idx for idx in df.index.values if idx in stable_list]
bbp_idx_labels = [i for i in bbp_idx]
bbp_idx_labels = df.index.isin(bbp_idx_labels)
model_idx;
model_idx = [idx for idx in df.index.values if type(idx)==str and idx not in bbp_idx]
model_index_labels = df.index.isin(model_idx)
experiment_idx_labels = df.index.isin(experiment_idx)
bbp_idx_labels = [idx for idx in df.index.values if type(idx)==str and idx and idx in bbp_idx]
bbp_idx_relabels = [(i,idx) for i,idx in enumerate(df.index.values) if idx in bbp_idx]
bbp_labels = df.index.isin(stable_list)
bbpindex = df.index.isin(stable_list)#.get_loc()
df[bbpindex].index.values;
with open('traub.p','rb') as f:
traub = pickle.load(f)
traub_check = list(traub.values())
traubindex = traub_check[0][0]
traub_idx = [idx for idx in df.index.values if idx in traubindex]
traub_idx_labels = df.index.isin(traub_idx)
traub_cells = df[df.index.isin(traub_idx)]
traub_cells;
df.columns;
import quantities as qt
qt.dimensionless
shape_index = { 'AP_fall_indices_1.5x':qt.dimensionless,
'AP_fall_indices_3.0x':qt.dimensionless,
'AP_rise_indices_1.5x':qt.dimensionless, 'AP_end_indices_3.0x':qt.dimensionless,
'fast_trough_index_1.5x':qt.dimensionless, 'fast_trough_index_3.0x':qt.dimensionless,
'min_AHP_indices_1.5x':qt.dimensionless, 'min_AHP_indices_3.0x':qt.dimensionless,
'peak_index_1.5x':qt.dimensionless,
'peak_index_3.0x':qt.dimensionless,
'peak_indices_1.5x':qt.dimensionless,
'peak_indices_3.0x':qt.dimensionless,
'threshold_index_1.5x':qt.dimensionless,
'threshold_index_3.0x':qt.dimensionless,
'upstroke_index_1.5x':qt.dimensionless,
'upstroke_index_3.0x':qt.dimensionless }
units = qt.ms
spike_shape = {'AP1RateOfChangePeakToTroughTest_3.0x':units,
'APlast_width_1.5x':units,
'initburst_sahp_vb_1.5x':qt.mV,
'peak_t_1.5x':units,
'peak_t_3.0x':units,
'peak_time_3.0x':units,
'sag_ratio1_3.0x':qt.mV/qt.ms,
'steady_state_hyper_1.5x':qt.mV,
'steady_state_voltage_stimend_1.5x':qt.mV,
'fast_trough_t_1.5x':qt.ms,
'fast_trough_t_3.0x':qt.ms,
'upstroke_t_1.5x':qt.ms,
'upstroke_t_3.0x':qt.ms,
'voltage_after_stim_3.0x':qt.mV,
'voltage_after_stim_1.5x':qt.mV,
'steady_state_voltage_1.5x':qt.mV,
'threshold_t_1.5x':qt.dimensionless,
'threshold_t_3.0x':qt.dimensionless
}
spike_time = {'AP2DelaySDTest_1.5x':units,
'AP2DelaySDTest_3.0x':units,
'AP_rise_time_1.5x':units,
'AP_rise_time_3.0x':units,
'ISIBurstMeanChangeTest_3.0x':units,
'ISICVTest_3.0x':units,
'ISI_log_slope_skip_3.0x':units,
'adaptation_index2_3.0x':qt.mV}
units = qt.MOhm
resistance= {'InputResistanceTest_1.5x':units,
'InputResistanceTest_3.0x':units ,
'ohmic_input_resistance_1.5x':qt.MOhm ,
'ohmic_input_resistance_3.0x':qt.MOhm ,
}
accomodation = {'AccommodationRateMeanAtSSTest_3.0x':'spikes/2 second',
'AccommodationRateToSSTest_3.0x':'spikes/2 second', 'adaptation_index2_3.0x':'spikes/2 second'}
rate = {'Spikecount_stimint_1.5x':'spike/2 Second','Spikecount_stimint_3.0x':'spike/2 Second'}
n_cols = 3
n_rows = int(len(accomodation)/3)+1
fig, axes = plt.subplots(1,3,figsize=(35, 15))
for i, (feature,units) in enumerate(accomodation.items()):
ax = axes.flat[i]
n = df.loc[model_index_labels, feature].notnull().sum()
sns.kdeplot(df.loc[model_index_labels, feature], color='r', ax=ax,bw=0.025) # label=('%d' % n)
sns.kdeplot(df.loc[experiment_idx_labels, feature], color='b', ax=ax,bw=0.025)
ax.set_ylabel(feature.replace('_', '\n'))
ax.set_xlabel(units)
plt.tight_layout()
plt.title('Accomodation')
fig.savefig('accomodation.pdf', format='pdf', dpi=600)
n_cols = 3
n_rows = int(len(rate)/3)+1
fig, axes = plt.subplots(1,3,figsize=(35, 15))
for i, (feature,units) in enumerate(rate.items()):
ax = axes.flat[i]
#ax.set_ylabel(str(feature)[0:10])
#df1[feature].hist(alpha=0.3, ax=ax)
n = df.loc[model_index_labels, feature].notnull().sum()
sns.kdeplot(df.loc[model_index_labels, feature], color='r', ax=ax)#, label=('%d' % n))
sns.kdeplot(df.loc[experiment_idx_labels, feature], color='b', ax=ax)
#ax.legend()
ax.set_ylabel(feature.replace('_', '\n'))
ax.set_xlabel(units)
plt.tight_layout()
plt.title('Spike Count')
fig.savefig('spike_count.pdf', format='pdf', dpi=600)
n_cols = 3
n_rows = int(len(resistance)/3)+1
#fig, axes = plt.subplots(n_cols,n_rows,figsize=(25, 15))
fig, axes = plt.subplots(n_rows,n_cols,figsize=(55, 45))
bw=0.045
for i, (feature,units) in enumerate(resistance.items()):
ax = axes.flat[i]
n = df.loc[model_index_labels, feature].notnull().sum()
sns.kdeplot(df.loc[model_index_labels, feature], color='r', ax=ax,bw=bw)
sns.kdeplot(df.loc[experiment_idx_labels, feature], color='b', ax=ax,bw=bw)
ax.set_xlabel(units)#+str("_bandwidth_kernel"+str(bw)))
ax.set_ylabel(feature.replace('_', '\n'))
ax.legend()
plt.title('Input Resistance')
fig.savefig('input_resistance.pdf', format='pdf', dpi=600)
#plt.tight_layout()
n_cols = 3
n_rows = int(len(spike_shape)/3)
plt.clf()
fig, axes = plt.subplots(n_rows,n_cols,figsize=(55, 45))
for i, (feature,units) in enumerate(spike_shape.items()):
ax = axes.flat[i]
n = df.loc[model_index_labels, feature].notnull().sum()
sns.kdeplot(df.loc[model_index_labels, feature], color='r', ax=ax,bw=0.045)
sns.kdeplot(df.loc[experiment_idx_labels, feature], color='b', ax=ax,bw=0.045)
ax.set_xlabel(units)
ax.set_ylabel(feature.replace('_', '\n'))
plt.tight_layout()
plt.title('Spike Time')
fig.savefig('spike_shape_time.pdf', format='pdf', dpi=600)
#ax.legend()
n_cols = 3
n_rows = int(len(shape_index)/3)+1
plt.clf()
fig, axes = plt.subplots(n_rows,n_cols,figsize=(55, 45))
for i, (feature,units) in enumerate(shape_index.items()):
ax = axes.flat[i]
n = df.loc[model_index_labels, feature].notnull().sum()
sns.kdeplot(df.loc[model_index_labels, feature], color='r', ax=ax,bw=0.045)
sns.kdeplot(df.loc[experiment_idx_labels, feature], color='b', ax=ax,bw=0.045)
ax.set_xlabel(units)
ax.set_ylabel(feature.replace('_', '\n'))
plt.tight_layout()
plt.title('Relative Time Array Indexs')
fig.savefig('spike_shape_index.pdf', format='pdf', dpi=600)
=======
#bbp_idx_labels;
#pd.set_option('precision', 3)
n_cols = 3
n_rows = int(len(df.columns)/3)+1
fig, axes = plt.subplots(n_rows,3,figsize=(15, 3*n_rows))
for i, feature in enumerate(df.columns):
ax = axes.flat[i]
ax.set_ylabel(str(feature)[0:10])
#df1[feature].hist(alpha=0.3, ax=ax)
n = df.loc[model_index_labels, feature].notnull().sum()
sns.kdeplot(df.loc[model_index_labels, feature], color='r', ax=ax, label=('%d' % n))
sns.kdeplot(df.loc[experiment_idx_labels, feature], color='b', ax=ax)
#ax.legend()
ax.set_ylabel(feature.replace('_', '\n'))
#ax.set_xlabel(units[feature])
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
You can see a plot of the high dimensional Druckman feature space projected into a low dimensional space using rotation matrices found via a regular PCA algorithm (not T distributed stochastic neighbourhood embedding).
PCA uses rotated covariance matrices to project original data into the directions where variance in the data is maximum. One disadvantage of this approach is that two of the highest weighted eigenvalues yield synthetic dimensions, synthetic dimensions that are hard to relate back to a just a few of the original Druckman dimensions.
In this way PCA and TSNE are useful data exploration tools, by the may not always lead to a complete understanding of the data.
In order to circumvent this problem we will use the variance-explained feature of "Random Forest" classification algorithm. Random Forest variance explained, will probably hint at which dimensions comprize the greatest eigenvalues/weights of the PCA algorithm.
isomap = Isomap(n_components=2)
isomap.fit(copy.copy(df.values))
iso = isomap.embedding_.T
pca = PCA()
pca.fit(df.values)
n_features = df.shape[1]
transformed = pca.transform(copy.copy(df.values))
# Do PCA and look at variance explained
fig = plt.figure()
plt.plot(range(1,n_features+1),pca.explained_variance_ratio_.cumsum())
plt.xlim(0,50);
plt.xlabel('Number of principal components')
plt.ylabel('Fraction of variance explained');
df;
#iso = iso.T
#iso[0,experiment_idx_labels]
print(np.shape(iso))
#iso = iso.T
fig, ax = plt.subplots(figsize=(25, 25))
plt.scatter(iso[0,experiment_idx_labels],iso[1,experiment_idx_labels],c='green',cmap='rainbow',label='data')
plt.scatter(iso[0,model_index_labels],iso[1,model_index_labels],alpha=0.5,c='red',cmap='rainbow',label='models')
plt.scatter(iso[0,new_model_labels],iso[1,new_model_labels],c='orange',cmap='rainbow',label='optimized/revised models')
plt.scatter(iso[0,experiment_idx_labels][42],iso[1,experiment_idx_labels][42],s=700,c='yellow', alpha=0.3,cmap='rainbow',label='Real Data Layer 4 aspiny 313862167')
plt.scatter(iso[0,gouwens_idx_labels][-7],iso[1,gouwens_idx_labels][-7],s=700,c='purple', alpha=0.3,cmap='rainbow',label='Model layer 4 spiny 479728896')
plt.scatter(iso[0,markram_idx_labels],iso[1,markram_idx_labels],s=50,c='cyan',cmap='rainbow',label='Markram models')
plt.scatter(iso[0,gouwens_idx_labels],iso[1,gouwens_idx_labels],c='black',cmap='rainbow',label='Gouwens models')
#plt.scatter(iso[0,regular_iz_idx_labels],iso[1,regular_iz_idx_labels],c='black',cmap='rainbow',label='Gouwens models')
plt.scatter(iso[0,bbp],iso[1,bbp],s=50, c='purple',cmap='rainbow',label='BBP')
#plt.scatter(iso[0,:],iso[1,:],c='green',cmap='rainbow',label='data')
legend = ax.legend()#handles, labels, loc="upper right", title="Sizes")
# I don't love the isomap fit
=======
In [44]:
#sns.clustermap(data=iso.corr)
#clusty = pd.DataFrame()
#clusty.cluster_gram()
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
def crawl_ids(url):
''' move to aibs '''
all_data = requests.get(url)
all_data = json.loads(all_data.text)
Model_IDs = []
for d in all_data:
Model_ID = str(d['Model_ID'])
Model_IDs.append(Model_ID)
return Model_IDs
import glob
import pickle
import json
import requests
#import get_three_feature_sets_from_nml_db as runnable_nml
#from get_three_feature_sets_from_nml_db import analyze_models_from_cache
#from neuronunit.get_three_feature_sets_from_nml_db import crawl_ids
def download_intensives():
list_to_get =[ str('https://neuroml-db.org/api/search?q=traub'),
str('https://neuroml-db.org/api/search?q=markram'),
str('https://neuroml-db.org/api/search?q=Gouwens') ]
regions = {}
for url in list_to_get:
Model_IDs = crawl_ids(url)
for Model_ID in Model_IDs:
url = str("https://neuroml-db.org/api/model?id=")+Model_ID
try:
model_contents = requests.get(url)
model_contents = json.loads(model_contents.text)
regions[Model_ID] = model_contents['keywords'][0]['Other_Keyword_term']
except:
pass
#print(regions)
with open('regions.p','wb') as f:
pickle.dump(regions,f)
return regions
regions = download_intensives()
# for k,v in regions.items(): print(k,v.split(',')[0]);
It looks like the regions are thalamo cortical (Traub) the regions are Somatosensory (Markram) the labels belong to regions are Gouwens IV (Mus Musculus)
# standard Normalizer
est = KMeans(n_clusters=3)
est.fit(iso.T)
y_kmeans = est.predict(iso.T)
centers = est.cluster_centers_
Another plot but with Kmeans cluster centers included. Showing the cluster centres is a first step towards showing that machine classification on the dimension reduced version of the Druckman data feature space.
In the plot below the two large yellow dots are the cluster centres for (left models), (right data). The Euclidian distnace from each data point from a cluster centre is directly proportional too which category the data point is from (ie model or data, ie red/blue). This visualization would assist us to understand using KMeans nearist neighbours classification algorithm to classify the data.
IN a Random Forest Classification Analysis performed much further below we examine the dimension that contributes the most to cluster seperation by looking at variance explained. This gives us an educated guess about dimensions that contribute the most weight to the axis of the PCA projection spaces plotted above.
It is likely that the axis in the PCA plot below, are strongly aligned with "Input Resistance" in models and data, as well as "AP2RateOfChangePeakToTroughTest". This second dimension means considering multi-spiking waveforms observed in models and data, at the second Action Potential/Spike, how rapid is the decay from peak to trough of the second AP wave.
plt.clf()
fig = plt.figure(figsize=(20,20))
ax = plt.subplot(111)
plt.scatter(iso[0,experiment_idx_labels],iso[1,experiment_idx_labels],c='green',cmap='rainbow',label='Allen Brain Experimental Data')
plt.scatter(iso[0,model_index_labels],iso[1,model_index_labels],c='red',cmap='rainbow',label='models')
plt.scatter(iso[0,new_model_labels],iso[1,new_model_labels],c='orange',cmap='rainbow',label='optimized/revised models')
plt.scatter(iso[0,gouwens_idx_labels],iso[1,gouwens_idx_labels],c='black',cmap='rainbow',label='Gouwens models')
plt.scatter(iso[0,markram_idx_labels],iso[1,markram_idx_labels],c='cyan',cmap='rainbow',label='Markram models')
plt.scatter(iso[0,bbp],iso[1,bbp],s=90, c='purple',cmap='rainbow',label='BBP')
#plt.scatter(iso[0,:],iso[1,:],s=10, c='purple',cmap='rainbow',label='BBP')
plt.scatter(centers[0][0],centers[0][1],s=7000,c='blue', alpha=0.3, edgecolors='blue')#,label='cluster 1')
plt.scatter(centers[1][0],centers[1][1],s=7000,c='blue', alpha=0.3,edgecolors='blue')#,label='cluster 2')
plt.scatter(centers[2][0],centers[2][1],s=7000,c='blue', alpha=0.3,edgecolors='blue')#,label='cluster 3')
#plt.scatter(centers[3][0],centers[3][1],s=7000,c='blue', alpha=0.3,edgecolors='blue')#,label='cluster 3')
legend = ax.legend()#handles, labels, loc="upper right", title="Sizes")
trace7=([iso[0,bbp]],[iso[1,bbp]],'BBP',bbp)
data = []
trace0=(iso[0,experiment_idx_labels],iso[1,experiment_idx_labels],'Allen Brain Experimental Data',experiment_idx)
trace1=(iso[0,model_index_labels],iso[1,model_index_labels],'models',model_idx)
trace2=(iso[0,new_model_labels],iso[1,new_model_labels],'optimized models',new_models_idx)
trace3=(iso[0,gouwens_idx_labels],iso[1,gouwens_idx_labels],'Gouwens models',gouwens_idx)
trace4=(iso[0,markram_idx_labels],iso[1,markram_idx_labels],'Markram models',markram_idx)
trace5=([iso[0,experiment_idx_labels][42]],[iso[1,experiment_idx_labels]],'Layer 4 aspiny 313862167',experiment_idx_labels)
trace6=([iso[0,gouwens_idx_labels][-7]],[iso[1,gouwens_idx_labels][-7]],'Layer 4 spiny 479728896',gouwens_idx_labels)
#plt.scatter(iso[0,regular_iz_idx_labels],iso[1,regular_iz_idx_labels],c='black',cmap='rainbow',label='Gouwens models')
#trace7=([iso[0,regular_iz_idx_labels]],[iso[1,regular_iz_idx_labels]],'regular_iz_idx_labels',regular_iz_idx_labels)
#trace7=([iso[0,bbp]],[iso[1,bbp]],'BBP',df[bbpindex].index.values)
trace7=(iso[0,bbp],iso[1,bbp],'Blue Brain Project Ephys Data',df[bbpindex].index.values)
trace8=(iso[0,traub_idx_labels],iso[1,traub_idx_labels],'Traub',traub_idx_labels)
#plt.scatter(iso[0,bbp_labels],iso[1,bbp_labels],s=10, c='purple',cmap='rainbow',label='BBP')
traces = [trace0,trace1,trace2,trace3,trace4,trace5,trace6,trace7]#,trace8]
cnt=0
theme = px.colors.diverging.Portland
for i,ttt in enumerate(traces):
if cnt==len(theme):
cnt=0
if i>1:
<<<<<<< HEAD
size = 6
=======
size = 12
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
else:
size = 6
#if type(ttt[3]) is not type(str()):
trace = dict(
type='scatter',
text = df[df.index.isin(ttt[3])].index,
x=ttt[0],
y=ttt[1],
mode='markers',
name=ttt[2],
marker=dict(
color=theme[cnt],
size=size,
line=dict(
color='rgba(217, 217, 217, 0.14)',
width=0.5),
opacity=0.8)
)
data.append(trace)
cnt+=1
fig = go.Figure(
data=data,
layout_title_text="steady_state_voltage_stimend_3.0x versus AP1DelayMeanTest_3.0x"
)#,
fig.update_layout(
autosize=False,
width=1050,
height=1050
)
if GITHUB:
fig.show("svg")
else:
pio.show(fig)
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
data = []
trace0=(iso[0,experiment_idx_labels],iso[1,experiment_idx_labels],'Allen Brain Experimental Data',experiment_idx)
trace1=(iso[0,model_index_labels],iso[1,model_index_labels],'models',model_idx)
trace2=(iso[0,new_model_labels],iso[1,new_model_labels],'optimized models',new_models_idx)
trace3=(iso[0,gouwens_idx_labels],iso[1,gouwens_idx_labels],'Gouwens models',gouwens_idx)
trace4=(iso[0,markram_idx_labels],iso[1,markram_idx_labels],'Markram models',markram_idx)
trace5=([iso[0,experiment_idx_labels][42]],[iso[1,experiment_idx_labels]],'Layer 4 aspiny 313862167',experiment_idx_labels)
trace6=([iso[0,gouwens_idx_labels][-7]],[iso[1,gouwens_idx_labels][-7]],'Layer 4 spiny 479728896',gouwens_idx_labels)
#plt.scatter(iso[0,regular_iz_idx_labels],iso[1,regular_iz_idx_labels],c='black',cmap='rainbow',label='Gouwens models')
#trace7=([iso[0,regular_iz_idx_labels]],[iso[1,regular_iz_idx_labels]],'regular_iz_idx_labels',regular_iz_idx_labels)
#trace7=([iso[0,bbp]],[iso[1,bbp]],'BBP',df[bbpindex].index.values)
trace7=(iso[0,bbp],iso[1,bbp],'Blue Brain Project Ephys Data',df[bbpindex].index.values)
#trace7=(iso[0,bbp],iso[1,bbp],'Blue Brain Project Ephys Data',df[bbpindex].index.values)
#plt.scatter(iso[0,bbp_labels],iso[1,bbp_labels],s=10, c='purple',cmap='rainbow',label='BBP')
traces = [trace0,trace1,trace2,trace3,trace4,trace5,trace6,trace7,None]
groundtruth = np.array(df.index.isin(experiment_idx))
classif = OneVsRestClassifier(SVC(kernel='linear'))
classif.fit(iso.T, groundtruth)
min_x = np.min(iso[0,:])
max_x = np.max(iso[0,:])
w = classif.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(min_x, max_x) # make sure the line is long enough
yy = a * xx - (classif.intercept_[0]) / w[1]
<<<<<<< HEAD
cnt=4
=======
cnt=3
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
theme = px.colors.diverging.Portland
used_columns = df.columns
pal = sns.husl_palette(8, s=.45)
lut = dict(zip(map(str, used_columns), pal))
colors = pd.Series(used_columns, index=df.columns).map(lut)
for i,ttt in enumerate(traces):
#if cnt==len(theme):
# cnt=0
size = 6
if i>1:
pass
#size = 12
else:
size = 6
text = df[df.index.isin(ttt[3])].index
if ttt==None:
trace = {
"line": {
"dash": "solid",
"color": "rgb(255,0,0)",
"shape": "linear",
"width": 2
},
"mode": "lines",
"name": "Decision Boundary",
"text": "Decision Boundary",
"type": "scatter",
"x": xx,
"y": yy,
"yaxis": "y1",
"showlegend": False
}
else:
trace = dict(
type='scatter',
text=text,
x=ttt[0],
y=ttt[1],
mode='markers',
name=ttt[2],
marker=dict(
color=colors[int(cnt*2.5)],
size=size,
line=dict(
color='rgba(217, 217, 217, 0.14)',
width=0.5),
opacity=0.8)
)
cnt+=1
data.append(trace)
fig = go.Figure(
data=data,
layout_title_text="steady_state_voltage_stimend_3.0x versus AP1DelayMeanTest_3.0x"
)#,
fig.update_layout(
autosize=False,
width=1050,
height=1050
)
if GITHUB:
fig.show("svg")
else:
pio.show(fig)
bbp_idx_labels
text = df[df.index.isin(bbp_idx_labels)].index
text;
# Do a TSNE embedding in two dimensions
tsne = TSNE(n_components=2, perplexity=30)
tsne.fit(df.values)
x = tsne.embedding_.T
plt.clf()
fig = plt.figure(figsize=(20,20))
ax = plt.subplot(111)
plt.scatter(x[0,experiment_idx_labels],x[1,experiment_idx_labels],c='blue',cmap='rainbow',label='data')
plt.scatter(x[0,model_index_labels],x[1,model_index_labels],c='red',cmap='rainbow',label='models')
plt.scatter(x[0,new_model_labels],x[1,new_model_labels],c='green',cmap='rainbow',label='optimized models')
plt.scatter(x[0,gouwens_idx_labels],x[1,gouwens_idx_labels],c='black',cmap='rainbow',label='Gouwens models')
plt.scatter(x[0,markram_idx_labels],x[1,markram_idx_labels],c='orange',s=100, cmap='rainbow',label='Markram models')
plt.scatter(x[0,experiment_idx_labels][42],x[1,experiment_idx_labels][42],s=700,c='purple', alpha=0.3,cmap='rainbow',label='Layer 4 aspiny 313862167')
plt.scatter(x[0,gouwens_idx_labels][-7],x[1,gouwens_idx_labels][-7],s=700,c='purple', alpha=0.3,cmap='rainbow',label=' layer 4 spiny 479728896')
plt.scatter(x[0,bbp],x[1,bbp],s=90, c='purple',cmap='rainbow',label='BBP')
#except:
# pass
legend = ax.legend()#handles, labels, loc="upper right", title="Sizes")
plt.show()
We can use cluster grams for two things.
which cells are clustered with other cells?
which features are clustered with other features?
Below is
Unfortunately this is hard to understand, it may be possible to truncate the dendrogram tree
#sns.clustermap(data=df.values);
Below is:
We can use cluster grams for two things.
which cells are clustered with other cells?
which features are clustered with other features?
Below is
Unfortunately this is hard to understand, it may be possible to truncate the dendrogram tree
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0cused_columns = df.columns
# Create a categorical palette to identify the networks
pal = sns.husl_palette(8, s=.45)
lut = dict(zip(map(str, used_columns), pal))
# Convert the palette to vectors that will be drawn on the side of the matrix
#networks = df.columns.get_level_values("network")
colors = pd.Series(used_columns, index=df.columns).map(lut)
# Draw the full plot
sns.clustermap(df.corr(), center=0, cmap="vlag",
row_colors=colors, col_colors=colors,
linewidths=.75, figsize=(25, 25))
=======
In [54]:
sns.clustermap(data=df.values)
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
temp = df.T.corr()
temp;
#sns.clustermap(temp, figsize=(25, 25))
used_columns = df.index
=======
Below is:
- which features are clustered with other features?
now lets look at how clustered the tests are.¶
In [55]:
used_columns = df.columns
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
# Create a categorical palette to identify the networks
pal = sns.husl_palette(8, s=.45)
lut = dict(zip(map(str, used_columns), pal))
# Convert the palette to vectors that will be drawn on the side of the matrix
#networks = df.columns.get_level_values("network")
<<<<<<< HEAD
colors = pd.Series(used_columns, index=temp.index).map(lut)
# Draw the full plot
sns.clustermap(temp, center=0, cmap="vlag",
row_colors=colors, col_colors=colors, figsize=(25,25))
=======
colors = pd.Series(used_columns, index=df.columns).map(lut)
# Draw the full plot
sns.clustermap(df.corr(), center=0, cmap="vlag",
row_colors=colors, col_colors=colors,
linewidths=.75, figsize=(13, 13))
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
<<<<<<< HEAD
Out[64]:
=======
Out[55]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
In [ ]:
<<<<<<< HEAD
=======
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
<<<<<<< HEAD
In [65]:
'''
from plotly.figure_factory import create_dendrogram
df_ = pd.DataFrame(temp)
labels=df_.index
labels
#fig = ff.create_dendrogram(X, orientation='left', labels=names)
fig = create_dendrogram(df_,labels=df_.index, orientation='left')
fig.update_layout(width=800, height=8000)
if GITHUB:
fig.show("svg")
else:
pio.show(fig)
'''
=======
In [56]:
temp = df.T.corr()
temp;
sns.clustermap(temp, figsize=(25, 25))
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
<<<<<<< HEAD
Out[65]:
=======
Out[56]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
<<<<<<< HEAD
In [66]:
import plotly.graph_objects as go
import plotly.figure_factory as ff
import numpy as np
from scipy.spatial.distance import pdist, squareform
# get data
#data = np.genfromtxt("http://files.figshare.com/2133304/ExpRawData_E_TABM_84_A_AFFY_44.tab",
# names=True,usecols=tuple(range(1,30)),dtype=float, delimiter="\t")
#data_array = data.view((np.float, len(data.dtype.names)))
#data_array = data_array.transpose()
#labels = data.dtype.names
df_ = pd.DataFrame(temp)
data_array = df_.values
labels = df_.index
# Initialize figure by creating upper dendrogram
fig = ff.create_dendrogram(data_array, orientation='bottom', labels=labels)
for i in range(len(fig['data'])):
fig['data'][i]['yaxis'] = 'y2'
# Create Side Dendrogram
dendro_side = ff.create_dendrogram(data_array, orientation='right')
for i in range(len(dendro_side['data'])):
dendro_side['data'][i]['xaxis'] = 'x2'
# Add Side Dendrogram Data to Figure
for data in dendro_side['data']:
fig.add_trace(data)
# Create Heatmap
dendro_leaves = dendro_side['layout']['yaxis']['ticktext']
dendro_leaves = list(map(int, dendro_leaves))
data_dist = pdist(data_array)
heat_data = squareform(data_dist)
heat_data = heat_data[dendro_leaves,:]
heat_data = heat_data[:,dendro_leaves]
heatmap = [
go.Heatmap(
x = dendro_leaves,
y = dendro_leaves,
z = heat_data,
colorscale = 'Blues'
)
]
heatmap[0]['x'] = fig['layout']['xaxis']['tickvals']
heatmap[0]['y'] = dendro_side['layout']['yaxis']['tickvals']
# Add Heatmap Data to Figure
for data in heatmap:
fig.add_trace(data)
# Edit Layout
fig.update_layout({'width':1800, 'height':1800,
'showlegend':False, 'hovermode': 'closest',
})
# Edit xaxis
fig.update_layout(xaxis={'domain': [.15, 1],
'mirror': False,
'showgrid': False,
'showline': False,
'zeroline': False,
'ticks':""})
# Edit xaxis2
fig.update_layout(xaxis2={'domain': [0, .15],
'mirror': False,
'showgrid': False,
'showline': False,
'zeroline': False,
'showticklabels': False,
'ticks':""})
# Edit yaxis
fig.update_layout(yaxis={'domain': [0, .85],
'mirror': False,
'showgrid': False,
'showline': False,
'zeroline': False,
'showticklabels': False,
'ticks': ""
})
# Edit yaxis2
fig.update_layout(yaxis2={'domain':[.825, .975],
'mirror': False,
'showgrid': False,
'showline': False,
'zeroline': False,
'showticklabels': False,
'ticks':""})
=======
In [57]:
from plotly.figure_factory import create_dendrogram
df_ = pd.DataFrame(temp)
labels=df_.index
labels
#fig = ff.create_dendrogram(X, orientation='left', labels=names)
fig = create_dendrogram(df_,labels=df_.index, orientation='left')
fig.update_layout(width=800, height=8000)
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
if GITHUB:
fig.show("svg")
else:
pio.show(fig)
<<<<<<< HEAD
In [67]:
#fig.show()
=======
In [ ]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
<<<<<<< HEAD
In [68]:
'''
from plotly.figure_factory import create_dendrogram
df_ = pd.DataFrame(iso.T)
fig = create_dendrogram(df_)
if GITHUB:
fig.show("svg")
else:
pio.show(fig)
'''
=======
In [58]:
from plotly.figure_factory import create_dendrogram
df_ = pd.DataFrame(iso.T)
fig = create_dendrogram(df_)
if GITHUB:
fig.show("svg")
else:
pio.show(fig)
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
<<<<<<< HEAD
Out[68]:
<<<<<<< HEAD
In [69]:
=======
In [59]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
#df[bbp];
<<<<<<< HEAD
In [70]:
=======
In [60]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
import plotly.express as px
import plotly.graph_objects as go
data = []
trace0=(x[0,experiment_idx_labels],x[1,experiment_idx_labels],'Allen Brain Ephys Data',experiment_idx)
trace1=(x[0,model_index_labels],x[1,model_index_labels],'models',model_idx)
trace2=(x[0,new_model_labels],x[1,new_model_labels],'optimized models',new_models_idx)
trace3=(x[0,gouwens_idx_labels],x[1,gouwens_idx_labels],'Gouwens models',gouwens_idx)
trace4=(x[0,markram_idx_labels],x[1,markram_idx_labels],'Markram models',markram_idx)
trace7=(x[0,bbp],x[1,bbp],'Blue Brain Project Ephys Data',bbp_idx_labels)
trace5=([x[0,experiment_idx_labels][42]],[x[1,experiment_idx_labels]],'Layer 4 aspiny 313862167',experiment_idx_labels)
trace6=([x[0,gouwens_idx_labels][-7]],[x[1,gouwens_idx_labels][-7]],'Layer 4 spiny 479728896',gouwens_idx_labels)
# trace7=([iso[0,gouwens_idx_labels][-7]],[iso[1,gouwens_idx_labels][-7]],'Layer 4 spiny 479728896',gouwens_idx_labels)
#plt.scatter(iso[0,bbp_labels],iso[1,bbp_labels],s=10, c='purple',cmap='rainbow',label='BBP')
trace7=(x[0,bbp],x[1,bbp],'Blue Brain Project Ephys Data',df[bbpindex].index.values)
classif = OneVsRestClassifier(SVC(kernel='linear'))
classif.fit(x.T, groundtruth)
min_x = np.min(x[0, :])
max_x = np.max(x[0, :])
w = classif.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(min_x, max_x) # make sure the line is long enough
yy = a * xx - (classif.intercept_[0]) / w[1]
used_columns = df.columns
pal = sns.husl_palette(8, s=.45)
lut = dict(zip(map(str, used_columns), pal))
colors = pd.Series(used_columns, index=df.columns).map(lut)
traces = [trace0,trace1,trace2,trace3,trace4,trace5,trace6,trace7,None]
<<<<<<< HEAD
cnt=4
theme = px.colors.diverging.Portland
=======
cnt=0
#theme = px.colors.diverging.Portland
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
for i,ttt in enumerate(traces):
if cnt==len(theme):
cnt=0
if i>1 and i!=7:
size = 12
else:
size = 6
if i==8:
trace = {
"line": {
"dash": "solid",
"color": "rgb(255,0,0)",
"shape": "linear",
"width": 2
},
"mode": "lines",
"name": "Decision Boundary",
"text": "Decision Boundary",
"type": "scatter",
"x": xx,
"y": yy,
"yaxis": "y1",
"showlegend": False
}
else:
trace = dict(
type='scatter',
text = df[df.index.isin(ttt[3])].index,
x=ttt[0],
y=ttt[1],
mode='markers',
name=ttt[2],
marker=dict(
<<<<<<< HEAD
color=theme[cnt],
=======
color=colors[cnt],
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
size=size,
line=dict(
color='rgba(217, 217, 217, 0.14)',
width=0.5),
opacity=0.8)
)
data.append(trace)
cnt+=1
layout = go.Layout(yaxis=dict(range=[-50, 50]))
fig = go.Figure(
data=data,
layout_title_text="steady_state_voltage_stimend_3.0x versus AP1DelayMeanTest_3.0x", layout=layout
)
#layout = {'scene': {'xaxis': {'showspikes': False}}}
fig.update_layout(
autosize=False,
width=1050,
height=1050
)
if GITHUB:
fig.show("svg")
else:
pio.show(fig)
t-SNE¶
The TSNE plot does a better job of spatially sperating experimental data from theoretical models in dimension reduced Druckman feature space.
<<<<<<< HEAD
In [71]:
=======
In [61]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
new_model_labels
<<<<<<< HEAD
Out[71]:
=======
Out[61]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
*Finally we examine the dimension that contributes the most to cluster seperation by looking at variance explained. This gives us an educated guess about dimensions that contribute the most weight to the axis of the PCA projection spaces plotted above.
Note to self, experimental_index needs updtating to include BBP cells.
<<<<<<< HEAD
In [72]:
=======
In [62]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
df_models = df[~df.index.isin(experiment_idx)]
df_data = df[df.index.isin(experiment_idx)]
# Assume they have the same columns
df_combined = pd.concat([df_data, df_models])
groundtruth = np.array(df.index.isin(experiment_idx))
rfc = RandomForestClassifier()
X = df_combined.values
rfc.fit(X, groundtruth)
importances = pd.Series(index = df_combined.columns, data=rfc.feature_importances_)
groundtruth[-9:-1]
print(importances.sort_values(ascending=False)[0:9])
<<<<<<< HEAD
In [73]:
=======
In [63]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
'''
try:
df = df.drop(columns = ['ohmic_input_resistance_3.0x'])
df = df.drop(columns = ['ohmic_input_resistance_1.5x'])
df = df.drop(columns = ['InputResistanceTest_1.5x'])
df = df.drop(columns = ['InputResistanceTest_3.0x'])
<<<<<<< HEAD
except:
pass
df = df.drop(columns = ['ohmic_input_resistance_1.5x'])
'''
#ohmic_input_resistance_3.0x 0.387
#sag_ratio1_3.0x 0.151
#ohmic_input_resistance_1.5x 0.104
#AP1RateOfChangePeakToTroughTest_3.0x 0.070
#InputResistanceTest_3.0x 0.054
df_models = df[~df.index.isin(experiment_idx)]
df_data = df[df.index.isin(experiment_idx)]
# Assume they have the same columns
df_combined = pd.concat([df_data, df_models])
groundtruth = np.array(df.index.isin(experiment_idx))
rfc = RandomForestClassifier()
X = df_combined.values
rfc.fit(X, groundtruth)
importances = pd.Series(index = df_combined.columns, data=rfc.feature_importances_)
groundtruth[-9:-1]
print(importances.sort_values(ascending=False)[0:9])
make four regions for ground truth¶
for a region dependent classification with four regions (0,1,2,3)
Gouwens and Allen 0
Markram and BBP 1
Hippocampus 2
Thalamocortical 3
In [74]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
df_models = df[~df.index.isin(experiment_idx)]
df_data = df[df.index.isin(experiment_idx)]
# Assume they have the same columns
df_combined = pd.concat([df_data, df_models])
groundtruth = np.array(df.index.isin(experiment_idx))
rfc = RandomForestClassifier()
X = df_combined.values
rfc.fit(X, groundtruth)
importances = pd.Series(index = df_combined.columns, data=rfc.feature_importances_)
groundtruth[-9:-1]
print(importances.sort_values(ascending=False)[0:9])
In [75]:
print('InputResistanceTest_3.0x' in df_data.columns)
print('InputResistanceTest_3.0x' in df_models.columns)
print('InputResistanceTest_3.0x' in df_combined)
In [108]:
fig, ax = plt.subplots(figsize=(25, 25))
ax = sns.kdeplot(df['AP1RateOfChangePeakToTroughTest_3.0x'], df['sag_ratio1_3.0x'],
cmap="Blues", shade=True,bw=.15)
#plt.show()
=======
except:
pass
df = df.drop(columns = ['ohmic_input_resistance_1.5x'])
'''
#ohmic_input_resistance_3.0x 0.387
#sag_ratio1_3.0x 0.151
#ohmic_input_resistance_1.5x 0.104
#AP1RateOfChangePeakToTroughTest_3.0x 0.070
#InputResistanceTest_3.0x 0.054
df_models = df[~df.index.isin(experiment_idx)]
df_data = df[df.index.isin(experiment_idx)]
# Assume they have the same columns
df_combined = pd.concat([df_data, df_models])
groundtruth = np.array(df.index.isin(experiment_idx))
rfc = RandomForestClassifier()
X = df_combined.values
rfc.fit(X, groundtruth)
importances = pd.Series(index = df_combined.columns, data=rfc.feature_importances_)
groundtruth[-9:-1]
print(importances.sort_values(ascending=False)[0:9])
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
<<<<<<< HEAD
In [106]:
ax = sns.kdeplot(df_models['AP1RateOfChangePeakToTroughTest_3.0x'], df_models['sag_ratio1_3.0x'],
cmap="Blues", shade=True,bw=.15)
=======
make four regions for ground truth¶
for a region dependent classification with four regions (0,1,2,3)
Gouwens and Allen 0
Markram and BBP 1
Hippocampus 2
Thalamocortical 3
In [64]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
df_models = df[~df.index.isin(experiment_idx)]
df_data = df[df.index.isin(experiment_idx)]
# Assume they have the same columns
df_combined = pd.concat([df_data, df_models])
groundtruth = np.array(df.index.isin(experiment_idx))
rfc = RandomForestClassifier()
X = df_combined.values
rfc.fit(X, groundtruth)
importances = pd.Series(index = df_combined.columns, data=rfc.feature_importances_)
groundtruth[-9:-1]
print(importances.sort_values(ascending=False)[0:9])
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
<<<<<<< HEAD
=======
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
<<<<<<< HEAD
In [109]:
ax = sns.kdeplot(df_data['AP1RateOfChangePeakToTroughTest_3.0x'], df_data['sag_ratio1_3.0x'],
cmap="Reds", shade=False,bw=.15)
=======
In [65]:
print('InputResistanceTest_3.0x' in df_data.columns)
print('InputResistanceTest_3.0x' in df_models.columns)
print('InputResistanceTest_3.0x' in df_combined)
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
<<<<<<< HEAD
=======
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
<<<<<<< HEAD
In [76]:
=======
In [66]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
plt.clf()
plt.scatter(df_data['InputResistanceTest_1.5x'], df_data['ohmic_input_resistance_3.0x'],label='experimental data')
plt.scatter(df_models['InputResistanceTest_1.5x'], df_models['ohmic_input_resistance_3.0x'],label='models')
plt.xlabel('InputResistanceTest_1.5x')
plt.ylabel('ohmic_input_resistance_3.0x')
plt.legend()
plt.show()
<<<<<<< HEAD
In [77]:
=======
In [67]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
plt.clf()
plt.scatter(df_data['sag_ratio1_3.0x'], df_data['ohmic_input_resistance_3.0x'],label='experimental data')
plt.scatter(df_models['sag_ratio1_3.0x'], df_models['ohmic_input_resistance_3.0x'],label='models')
plt.xlabel('sag_ratio1_3.0x')
plt.ylabel('ohmic_input_resistance_3.0x')
plt.legend()
plt.show()
<<<<<<< HEAD
In [79]:
#from sklearn import preprocessing
#df_data_b = df[bbp]
#df_data_b[:] = preprocessing.normalize(df_data_b.values, norm='l2')
#experiment_df[:] = preprocessing.normalize(experiment_df.values, norm='l2')
#temp_bbpindex = [i for i,idx in enumerate(df_data.index.values) if idx in stable_list]
#temp_bbpindex
'''
df_data = pd.concat([experiment_df,df_data_b])
#temp_allenindex = list(range(len(temp_bbpindex),len(df_data)))
plt.clf()
plt.scatter(df_data['sag_ratio1_3.0x'], df_data['AP1RateOfChangePeakToTroughTest_3.0x'],label='BBP experimental Cells')
plt.scatter(experiment_df['sag_ratio1_3.0x'], experiment_df['AP1RateOfChangePeakToTroughTest_3.0x'],label='Allen experimental Cells')
#plt.scatter(df_o_m['steady_state_voltage_stimend_3.0x'], df_o_m['AP1DelayMeanTest_3.0x'],label='models')
plt.xlabel('sag_ratio1_3.0x')
plt.ylabel('AP1RateOfChangePeakToTroughTest_3.0x')
plt.legend()
plt.show()
'''
Out[79]:
In [80]:
def crawl_ids(url):
''' move to aibs '''
all_data = requests.get(url)
all_data = json.loads(all_data.text)
Model_IDs = []
for d in all_data:
Model_ID = str(d['Model_ID'])
Model_IDs.append(Model_ID)
return Model_IDs
import glob
import pickle
import json
import requests
#import get_three_feature_sets_from_nml_db as runnable_nml
#from get_three_feature_sets_from_nml_db import analyze_models_from_cache
#from neuronunit.get_three_feature_sets_from_nml_db import crawl_ids
def download_intensives():
list_to_get =[ str('https://neuroml-db.org/api/search?q=traub'),
str('https://neuroml-db.org/api/search?q=markram'),
str('https://neuroml-db.org/api/search?q=Gouwens') ]
regions = {}
for url in list_to_get:
Model_IDs = crawl_ids(url)
for Model_ID in Model_IDs:
url = str("https://neuroml-db.org/api/model?id=")+Model_ID
try:
model_contents = requests.get(url)
model_contents = json.loads(model_contents.text)
regions[Model_ID] = model_contents['keywords'][0]['Other_Keyword_term']
except:
pass
#print(regions)
with open('regions.p','wb') as f:
pickle.dump(regions,f)
return regions
#regions = download_intensives()
In [96]:
=======
In [68]:
from sklearn import preprocessing
df_data_b = df[bbp]
df_data_b[:] = preprocessing.normalize(df_data_b.values, norm='l2')
experiment_df[:] = preprocessing.normalize(experiment_df.values, norm='l2')
#temp_bbpindex = [i for i,idx in enumerate(df_data.index.values) if idx in stable_list]
#temp_bbpindex
df_data = pd.concat([experiment_df,df_data_b])
#temp_allenindex = list(range(len(temp_bbpindex),len(df_data)))
plt.clf()
plt.scatter(df_data_b['sag_ratio1_3.0x'], df_data_b['AP1RateOfChangePeakToTroughTest_3.0x'],label='BBP experimental Cells')
plt.scatter(experiment_df['sag_ratio1_3.0x'], experiment_df['AP1RateOfChangePeakToTroughTest_3.0x'],label='Allen experimental Cells')
#plt.scatter(df_o_m['steady_state_voltage_stimend_3.0x'], df_o_m['AP1DelayMeanTest_3.0x'],label='models')
plt.xlabel('sag_ratio1_3.0x')
plt.ylabel('AP1RateOfChangePeakToTroughTest_3.0x')
plt.legend()
plt.show()
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
<<<<<<< HEAD
In [82]:
=======
In [69]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
# SQL output is imported as a dataframe variable called 'df'
#import pandas as pd
#import plotly.plotly as py
import plotly.graph_objs as go
# 3D scatter plot. Resource: https://plot.ly/python/3d-scatter-plots/
theme = px.colors.diverging.Portland
experimental_data = go.Scatter3d(
x = df_data['sag_ratio1_3.0x'],
y = df_data['AP1RateOfChangePeakToTroughTest_3.0x'],
z = df_data['ohmic_input_resistance_3.0x'],
mode ='markers',
name = 'experimental data',
marker =dict(
color = theme[0],
size = 3,
opacity = 0.9
)
)
"""
opt_models = go.Scatter3d(
x = df_o_m3['AP_begin_voltage_1.5x'],
y = df_o_m3['AP1AHPDepthTest_1.5x'],
z = df_o_m3['decay_time_constant_after_stim_3.0x'],
mode ='markers',
name = 'bad Model',
marker =dict(
color = theme[1],
size = 3,
opacity = 0.9
)
)
"""
models = go.Scatter3d(
x = experiment_df['sag_ratio1_3.0x'],
y = experiment_df['AP1RateOfChangePeakToTroughTest_3.0x'],
z = experiment_df['ohmic_input_resistance_3.0x'],
mode ='markers',
name = 'Models',
marker =dict(
color = theme[1],
size = 3,
opacity = 0.9
)
)
data = [experimental_data , models]
layout = go.Layout(
scene = dict(xaxis = dict(title='AP_begin_voltage_1.5x'),
yaxis = dict(title='AP1AHPDepthTest_1.5x'),
zaxis = dict(title='decay_time_constant_after_stim_3.0x'),),
margin=dict(
l=10,
r=10,
b=10,
t=10
)
)
fig = go.Figure(data=data, layout=layout)
df_o_m3
# Use Periscope to visualize a dataframe, text, or an image by passing data to periscope.table(), periscope.text(), or periscope.image() respectively.
#py.show(fig)
#py.plot(fig, filename='data_only_3D_scatter.html')
if GITHUB:
fig.show("svg")
else:
fig.show()
As I wrote above Random Forest Variance explained, tells us the dimensions of the Druckman feature space that most strongly assist in classifying models versus data. When we identify features that seperate models and data using Variance Explained, we are then able to iteratively variables that contribute more heavily to data variance. We can remove variables that explain most variance, until machine classification can no longer tell models and data apart, leaving us with a small list of tests which models need to perform better on, these tests correspond to measurable electrical properties of cells that need to be better aligned with data.
Two such measurable electrical properties are Druckman features with high variance explained are re: "Input Resistance" in models and data, as well as "AP2RateOfChangePeakToTroughTest". This second dimension called AP2RateOfChangePeakToTroughTest means: when considering multi-spiking waveforms observed, at the second Action Potential/Spike, how rapid is the average decay from peak to trough of the second AP wave. Since Action Potential wave attack and decay shapes are non-linear, the instantaneous gradient from the peak of the wave is not informative, and it is more useful to measure the time interval needed needed for a decay from a spike, to a state of hyperpolarization, corresponding to a neurons "refractory-period".
Already, we are have arrived at useful information, pertaining to the point of the exercise, as we now have a small list of electrical tests, that we want optimized models to perform better on, such that models and data will be more aligned with each other.
As neural modelers with a great interest in mimicing a diverse range of experimental data using models. The least convincing aspects of our models as mimics of data, are these top ten features. In other words the least convincing aspects of our models are: AP2RateOfChangePeakToTroughTest, Input Resistance values (a scalar), and the amplitude of the first and second spike.
Prediction Results When I Use Random Forests on all 51 feaature dimensions¶
- remember that our ground truth labels are booleans that are defined like this:
groundtruth = np.array(df_combined.index.isin(experiment_idx))
Which is labeled as "True" for this data point is an experiment, and "False" for this data point is a model.
Machine classification can successfuly discrimate that our optimized cells are models and not data (this is bad news for us).
In this context in order to bolster out optimized models, a high false-negative rate is desirable. Unfortunately for us, that is not what we see. The Random Forest Classifier (RFC) correctly identies that all 11 of the new optimized cells are not derived from experiments (they are models). That is bad news for us.
<<<<<<< HEAD
In [83]:
=======
In [70]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
df_o_m.columns;
<<<<<<< HEAD
In [84]:
=======
In [71]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
rfc = RandomForestClassifier()
X = df_combined.values
X;
<<<<<<< HEAD
In [85]:
=======
In [72]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
df_o_m.values;
> 99%¶
of models are classified as models when optimized models are not included.
If the random forest is trained on the optimized models too
Then the area under the ROC curve becomes closer to 1.
When we feed in the 7 new optimized models as "validation data", in this context, the RFC is still okay at classifying our new models correctly as models, and not data. However, the RFC performance is significanlty worse, as the optimized cells have tricked the RFC 4 times.
Also since the output of the TSNE-PCA varies with each run, as it is seeded with a psuedo random numnber generator, the projection space that the RFC acts on is different each time. Meaning that the FPR and the TPR rates vary slightly on each run.
In the plot below we show the we show the decision boundary as used by our classifier.¶
The decision boundary llows us to see if the newer optimized models are classified as data or models
<<<<<<< HEAD
In [86]:
=======
In [73]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
import plotly.express as px
import plotly.graph_objects as go
- Blue means experimental data, red means model, as machine categorized using Sklearns Random Forest classifier.
A general trend is apparent at the macro scale. It seems as if the bottom 2/3rds of the plane belong to data, and the remaining upper 1/3rd of the plane belongs to models, however you can also see on a micro scale there are lots of small pockets, or islands of model decision territory inside, what is more generally regarded as data territory.
Although the large red dots, appear in the correct side of the macro decision boundary, zooming in would reveal that these optimized models are in fact enveloped by model island that is excatly small enough to contain them. Therefore random forest classification, correctly classifies the optimized models as models.
The above figure allows us¶
To argue that the newer optimized models are closer to cluster centres and often fall on the experimental data side of the decision boundary.
Lets switch back to using all 38 features to classify¶
Only as its easier for me to debug, and I can make progress more quickly.
Using cross validation below you can see that this approach is generalizable.
<<<<<<< HEAD
In [87]:
=======
In [74]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
from sklearn import metrics
<<<<<<< HEAD
In [88]:
=======
In [75]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
#!pip install --update sklearn
len(groundtruth)
groundtruth
df_combined.head()
<<<<<<< HEAD
Out[88]:
=======
Out[75]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
<<<<<<< HEAD
In [89]:
=======
In [76]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
from sklearn.model_selection import train_test_split
for i in range(0,10):
X_train, X_test, y_train, y_test = train_test_split(df_combined.values, groundtruth, test_size=0.5, random_state=42)
rfc = RandomForestClassifier()
All 7 seven new models are classified as data¶
Ie it is false that they are percieved as models, its true that they are percieved as data.
This will enable us to use a cross-validation Approach.¶
Cross-validation will help us to check the generalizability of our model, by better navigating the bias-variance tradeoff.
Report on misclassification.¶
Even though RFC can be over-fit by using all the data over 148 features a False negative is still sometimes reported.
Experimental data point with identifier: "482764620" is sometimes falsely classified as a model, but we know from ground truth that it is an experiment.
The outputs of thistest are a bit different each time.
<<<<<<< HEAD
=======
In [ ]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
Repeat above with just experimental data¶
<<<<<<< HEAD
In [90]:
=======
In [77]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
# make model dataframe
model_idx = [idx for idx in df.index.values if type(idx)==str]
model_no_trans_df = df[df.index.isin(model_idx)]
model_no_trans_df.index.name = 'Cell_ID'
model_df = model_no_trans_df.copy()
model_df.index.name = 'Cell_ID'
# make experiment dataframe
experiment_idx = [idx for idx in df.index.values if type(idx)==int]
experiment_no_trans_df = df[df.index.isin(experiment_idx)]
experiment_df = experiment_no_trans_df.copy()
experiment_df;
<<<<<<< HEAD
In [91]:
=======
In [78]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
model_df[:] = ss.fit_transform(model_no_trans_df.values);
<<<<<<< HEAD
In [92]:
=======
In [ ]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
model_no_trans_df.head();
<<<<<<< HEAD
In [93]:
=======
In [ ]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
model_df.head();
<<<<<<< HEAD
In [94]:
=======
In [ ]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
# PLay around with the value of perplexity. Recommended values are between 5 and 50.
#See if any of the clusters that pop out mean anything (do they match existing cell type clusters?
# Do they match known firnig patterns?
# Where does the data fit in when it is also plotted in this space?)
perplexities = [25,30,35,40]
df = model_df.copy()
fig, ax = plt.subplots(2,2)
ax = ax.ravel()
for i, perp in enumerate(perplexities):
# init = 'pca' or 'random'
tsne = TSNE(n_components=2,
init='random',
random_state=0,
perplexity=perp, # default = 30, should be less than the number of samples
n_iter=1000) # default = 1000
%time tsne.fit(df.values) # can't use transpose
ax[i].scatter(*tsne.embedding_.T);
if i in [2,3]:
ax[i].set_xlabel('TSNE-1')
if i in [0,2]:
ax[i].set_ylabel('TSNE-2')
ax[i].set_title('Perplexity = %s' %perp)
<<<<<<< HEAD
In [95]:
=======
In [ ]:
>>>>>>> d1c0de820d3deff135ec0725346085492591fd0c
try:
os.mkdir('data')
except:
pass
filename = os.path.join(path2data,'new_cortical_ephys.csv')
model_df.to_csv(filename)
filename = os.path.join(path2data,'new_cortical_ephys_no_trans.csv')
model_no_trans_df.to_csv(filename)
filename = os.path.join(path2data,'experiment_ephys_no_trans.csv')
experiment_no_trans_df.to_csv(filename)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: